QMSS5015 Lab 2¶

October 12, 2024¶

Sam Anwar¶

The dataset: I'm continuing with the same dataset I used for Lab 1-- 2015 NYC Yellow Taxi rides.

I am pulling the data in via API access to the City of New York's Open Data repository because the full file (over 14 million rows) is too large to store locally. I set it to pull in 10,000 rows at random. The exact data you see will be different from the exact data I pulled from the API. I tried to keep observations as general as possible with this in mind, but you may see different descriptive statistics than the ones I write about in these markdown cells if you re-ran this with a new kernel.

I use the Plotly library for my visualizations. They're all interactive! Please explore the data dynamically.

Data source: https://data.cityofnewyork.us/Transportation/2015-Yellow-Taxi-Trip-Data/2yzn-sicd/about_data

In [2]:
# import all necessary libraries
import pandas as pd
import plotly.express as px
import json
import geopandas as gpd
from shapely.geometry import Point

Question 1: Recode 2 different variables into new categories. They can both be continuous-ish or both be nominal-ish, or one of each. Tell me what you did and explain the variable(s).¶

For the first variable, I want to recode pick up lat/longs into neighborhoods. This will involve transforming the lat/longs into a geometric object, and then merging with a geoJSON file which contains polygons of each NYC neighborhood. I'm interested in this because the lat/longs on their own are not easy to interpret. Grouping them by neighborhood will lead to more interesting analysis.

I also recode several neighborhoods as "Other" if their frequency is low.

GeoJSON file source here: https://github.com/HodgesWardElliott/custom-nyc-neighborhoods

In [5]:
# read in the data
df = pd.read_csv('https://data.cityofnewyork.us/resource/2yzn-sicd.csv?$limit=50000') # default limit value is 1000 so set to 50,000

df = df.dropna(subset=['ratecodeid', 'pickup_latitude', 'pickup_longitude', 
                       'dropoff_latitude', 'dropoff_longitude', 'total_amount']) # drop rows if any of these cols of interest are NA

df.drop(df[df['pickup_latitude'] == 0].index, inplace=True) # drop if pickup_latitude is 0 (data error)

df.drop(df[df['total_amount'] < 0].index, inplace = True) # drop if total amount paid is negative (data error)

df = df.reset_index().drop(columns = 'index') # reset index

df.head()
Out[5]:
passenger_count trip_distance pickup_longitude pickup_latitude ratecodeid store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount extra mta_tax tip_amount tolls_amount total_amount vendor_id pickup_datetime dropoff_datetime rate_code payment_type imp_surcharge
0 1 1.75 -73.981003 40.759609 1 N -73.991791 40.746620 11.0 0.5 0.5 0.00 0.00 12.30 2 2015-08-21T23:14:17.000 2015-08-21T23:30:20.000 NaN 2 0.3
1 2 1.55 -73.991776 40.750240 1 N -73.981514 40.762455 7.5 0.0 0.5 0.00 0.00 8.30 2 2015-08-01T10:59:47.000 2015-08-01T11:08:12.000 NaN 2 0.3
2 1 15.10 -73.983566 40.749882 2 N -73.804558 40.649757 52.0 0.0 0.5 11.65 5.54 69.99 1 2015-08-17T14:55:42.000 2015-08-17T15:34:57.000 NaN 1 0.3
3 1 2.90 -74.000458 40.727295 1 N -73.979805 40.761723 10.5 0.5 0.5 0.00 0.00 11.80 1 2015-08-08T02:51:51.000 2015-08-08T03:01:14.000 NaN 2 0.3
4 1 15.90 -73.972786 40.749996 2 N -73.786385 40.644257 52.0 0.0 0.5 11.67 5.54 70.01 2 2015-08-30T08:15:38.000 2015-08-30T08:34:47.000 NaN 1 0.3
In [7]:
# read in geoJSON file of NYC neighborhoods
url = 'https://raw.githubusercontent.com/HodgesWardElliott/custom-nyc-neighborhoods/refs/heads/master/custom-pedia-cities-nyc-Mar2018.geojson'

neighborhoods = gpd.read_file(url)
In [9]:
# need to turn lat/lon in original df to a geometric object
# create geometry column in df from latitude and longitude
df['geometry'] = df.apply(lambda row: Point(row['pickup_longitude'], row['pickup_latitude']), axis=1)

# convert df to a geodf
gdf = gpd.GeoDataFrame(df, geometry='geometry')

# make sure both geodfs are at the same coordinate reference system
gdf.set_crs(epsg=4326, inplace=True)
neighborhoods = neighborhoods.to_crs(epsg=4326)

# spatial join the geometric point from og dataframe with neighborhoods file
# now each coordinate is mapped to a neighborhood
joined_gdf = gpd.sjoin(gdf, neighborhoods, how='left', op='within')

joined_gdf.head()
/opt/anaconda3/lib/python3.12/site-packages/IPython/core/interactiveshell.py:3517: FutureWarning: The `op` parameter is deprecated and will be removed in a future release. Please use the `predicate` parameter instead.
  if await self.run_code(code, result, async_=asy):
Out[9]:
passenger_count trip_distance pickup_longitude pickup_latitude ratecodeid store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount extra ... dropoff_datetime rate_code payment_type imp_surcharge geometry index_right neighborhood boroughCode borough X.id
0 1 1.75 -73.981003 40.759609 1 N -73.991791 40.746620 11.0 0.5 ... 2015-08-21T23:30:20.000 NaN 2 0.3 POINT (-73.98100 40.75961) 278.0 Theater District 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho...
1 2 1.55 -73.991776 40.750240 1 N -73.981514 40.762455 7.5 0.0 ... 2015-08-01T11:08:12.000 NaN 2 0.3 POINT (-73.99178 40.75024) 51.0 Chelsea 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho...
2 1 15.10 -73.983566 40.749882 2 N -73.804558 40.649757 52.0 0.0 ... 2015-08-17T15:34:57.000 NaN 1 0.3 POINT (-73.98357 40.74988) 187.0 Midtown 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho...
3 1 2.90 -74.000458 40.727295 1 N -73.979805 40.761723 10.5 0.5 ... 2015-08-08T03:01:14.000 NaN 2 0.3 POINT (-74.00046 40.72729) 126.0 Greenwich Village 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho...
4 1 15.90 -73.972786 40.749996 2 N -73.786385 40.644257 52.0 0.0 ... 2015-08-30T08:34:47.000 NaN 1 0.3 POINT (-73.97279 40.75000) 197.0 Murray Hill 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho...

5 rows × 26 columns

In [11]:
joined_gdf.neighborhood.value_counts()
# let's regroup neighborhoods with low frequency to "Other"
Out[11]:
neighborhood
Midtown            9084
Upper East Side    6064
Chelsea            4859
Upper West Side    3836
Hell's Kitchen     2655
                   ... 
Borough Park          1
Arverne               1
Morrisania            1
Jamaica Estates       1
Rockaway Park         1
Name: count, Length: 122, dtype: int64
In [13]:
joined_gdf = joined_gdf.dropna(subset=['neighborhood'])

freq = joined_gdf.neighborhood.value_counts().to_dict()

neighbor2 = []
for n in joined_gdf.neighborhood:
    if freq[n] < 500:
        neighbor2.append('Other')
    else:
        neighbor2.append(n)

joined_gdf['neighbor_recode'] = neighbor2
joined_gdf.head()
Out[13]:
passenger_count trip_distance pickup_longitude pickup_latitude ratecodeid store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount extra ... rate_code payment_type imp_surcharge geometry index_right neighborhood boroughCode borough X.id neighbor_recode
0 1 1.75 -73.981003 40.759609 1 N -73.991791 40.746620 11.0 0.5 ... NaN 2 0.3 POINT (-73.98100 40.75961) 278.0 Theater District 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Theater District
1 2 1.55 -73.991776 40.750240 1 N -73.981514 40.762455 7.5 0.0 ... NaN 2 0.3 POINT (-73.99178 40.75024) 51.0 Chelsea 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Chelsea
2 1 15.10 -73.983566 40.749882 2 N -73.804558 40.649757 52.0 0.0 ... NaN 1 0.3 POINT (-73.98357 40.74988) 187.0 Midtown 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Midtown
3 1 2.90 -74.000458 40.727295 1 N -73.979805 40.761723 10.5 0.5 ... NaN 2 0.3 POINT (-74.00046 40.72729) 126.0 Greenwich Village 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Greenwich Village
4 1 15.90 -73.972786 40.749996 2 N -73.786385 40.644257 52.0 0.0 ... NaN 1 0.3 POINT (-73.97279 40.75000) 197.0 Murray Hill 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Murray Hill

5 rows × 27 columns

In [15]:
# for readability, also shorten some of the neighborhoods
joined_gdf['neighbor_recode'] = joined_gdf['neighbor_recode'].replace(
    {'John F. Kennedy International Airport': 'JFK', 
     'LaGuardia Airport': 'LGA',
     'Stuyvesant Town': 'Stuy Town'})
In [17]:
fig = px.histogram(joined_gdf, x = 'neighbor_recode',
                   height = 600)
fig.update_xaxes(categoryorder = 'total descending')
fig.show()

Observations: Recoding pickup lat/longs into neighborhood makes the data far easier to interpret, simplifying a continuous variable into 24 buckets.

For my second recode, I'll recode total amount paid into the below categories:

  • Low: < $20
  • Moderate: $20 - 50
  • High: $50 - 100
  • Most Expensive: > $100
In [20]:
joined_gdf.total_amount.describe()
Out[20]:
count    49254.000000
mean        16.134297
std         13.381670
min          0.000000
25%          8.750000
50%         11.800000
75%         17.800000
max        450.310000
Name: total_amount, dtype: float64
In [22]:
paid_labels = ['Low Cost', 'Moderate Cost', 'High Cost', 'Most Expensive']

joined_gdf['cost_tier'] = pd.cut(joined_gdf.total_amount,
                                 bins = [0, 20, 50, 100, 500],
                                 labels = paid_labels,
                                 right = True, ordered = True)
In [24]:
# look at data via a box plot
box = px.box(joined_gdf, x = 'cost_tier', y = 'total_amount',
             height = 600, width = 800)
box.show()

Observations: The box plot for this looks a little strange because of the variability in the data. Most rides are < $200 which is where the quantiles are marked, but there are many outliers above the Upper Fence of the Most Expensive category.

Question 2: Use one (or both) of your recoded variables to do a cross-tabulation. Explain your results.¶

I'm interested in cross-tabulating both of our recoded variables, neighborhood and cost tier.

In [28]:
pd.crosstab(joined_gdf.cost_tier, joined_gdf.neighbor_recode, normalize='columns').round(4)*100
Out[28]:
neighbor_recode Battery Park City Central Park Chelsea East Village Financial District Flatiron District Gramercy Greenwich Village Hell's Kitchen JFK ... Lower East Side Midtown Murray Hill Other SoHo Theater District Tribeca Upper East Side Upper West Side West Village
cost_tier
Low Cost 61.6 88.51 86.38 83.76 57.33 87.04 84.91 85.34 86.52 6.43 ... 75.89 85.59 85.85 76.33 82.21 82.57 76.80 86.38 87.10 85.00
Moderate Cost 35.4 10.00 11.87 14.84 39.50 11.74 13.49 13.32 11.86 26.25 ... 22.49 12.12 11.77 20.97 16.68 14.02 21.51 12.63 11.60 13.88
High Cost 2.6 1.34 1.61 1.29 3.07 1.10 1.60 1.34 1.51 66.49 ... 1.61 2.20 2.09 2.48 1.11 3.32 1.68 0.94 1.28 1.02
Most Expensive 0.4 0.15 0.14 0.10 0.10 0.12 0.00 0.00 0.11 0.83 ... 0.00 0.09 0.29 0.21 0.00 0.10 0.00 0.05 0.03 0.10

4 rows × 22 columns

In [30]:
# turn crosstab into new df so that we can keep the normalized values
cross = pd.crosstab(joined_gdf.cost_tier, joined_gdf.neighbor_recode, normalize='columns').round(4)*100

# pull values for each color in each bar
low = list(cross.loc[cross.index == 'Low Cost'].values.flatten())
moderate = list(cross.loc[cross.index == 'Moderate Cost'].values.flatten())
high = list(cross.loc[cross.index == 'High Cost'].values.flatten())
expensive = list(cross.loc[cross.index == 'Most Expensive'].values.flatten())

import plotly.graph_objects as go

# create bar chart
# add a bar for each cost category
# fill in the values
fig = go.Figure(data=[
    go.Bar(name = 'Low Cost', x=list(cross.columns), y=low),
    go.Bar(name = 'Moderate Cost', x=list(cross.columns), y=moderate),
    go.Bar(name = 'High Cost', x=list(cross.columns), y=high),
    go.Bar(name = 'Most Expensive', x=list(cross.columns), y=expensive)
])

# change the bar mode
fig.update_layout(barmode='stack', width = 800, height = 600)
fig.show()

Observations: The cross-tabulation shows the normalized distribution of price tiers based on neighborhood. As expected, most rides fall into low cost. The two that deviate are the airports, JFK and LGA. Cab rides that begin in JFK have a set fare of $70, so it makes sense that most fall into the high cost tier (50 - 100). LGA is interesting because it arguably has the weakest public transit support out of the three airports in the NY Metro area. I personally always Uber/Taxi home from LGA, but will take the train home from JFK. So it anecdotally makes sense that most rides that begin at LGA fall into the moderate cost tier.

Question 3: Run a correlation of one variable with another variable; make all of the recodes necessary to make the correlation as easy to interpret as possible; and explain your results.¶

I'm interested in the correlation between trip distance and tip amount. While it makes sense that further trips cost more, tip is at discretion of the passenger, so it'll be interesting to see if there are any noticeable ceilings.

In [34]:
round(joined_gdf['trip_distance'].corr(joined_gdf['tip_amount']), 3)
Out[34]:
0.468
In [36]:
round(joined_gdf['trip_distance'].corr(joined_gdf['total_amount']), 3)
Out[36]:
0.904
In [38]:
# visualize the data
scatter = px.scatter(joined_gdf.sample(n=500, random_state = 1), # for readability, I plot a random sample of 500 data points
                     x = 'trip_distance', 
                     y = 'tip_amount',
                     height = 600,
                     width = 800)
scatter.show()

Observations: The correlation between trip_distance and tip_amount is 0.468. Out of curiosity, I also looked at at the correlation between trip_distance and total_amount as a comparison. The correlation for the latter is far stronger at 0.904.

Tip amounts are not standardized so the relationship is weaker. There are also a large number of rides with $0 tip which certainly affects the linearity of the relationship.

Question 4: Identify the most extreme cases on some variable. Interpret the results.¶

Let's look at the most expensive and least expensive rides.

In [42]:
# update settings to view all columns
pd.set_option('display.max_columns', None)

# new df with reset index after sort
gdf = joined_gdf.sort_values(by = 'total_amount', ascending = False).reset_index(drop = True)

# view top 3 most expensive rides
gdf.head(3)
Out[42]:
passenger_count trip_distance pickup_longitude pickup_latitude ratecodeid store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount extra mta_tax tip_amount tolls_amount total_amount vendor_id pickup_datetime dropoff_datetime rate_code payment_type imp_surcharge geometry index_right neighborhood boroughCode borough X.id neighbor_recode cost_tier
0 1 0.00 -73.977936 40.593708 5 N -73.977936 40.593708 0.01 0.0 0.0 0.0 450.00 450.31 2 2015-08-10T01:02:28.000 2015-08-10T01:02:54.000 NaN 1 0.3 POINT (-73.97794 40.59371) 121.0 Gravesend 3 Brooklyn http://nyc.pediacities.com/Resource/Neighborho... Other Most Expensive
1 1 27.10 -74.164146 40.618874 1 N -73.950554 40.808949 78.00 0.0 0.5 266.0 0.00 344.80 2 2015-08-22T12:40:25.000 2015-08-22T13:53:19.000 NaN 1 0.3 POINT (-74.16415 40.61887) 42.0 Bull's Head 5 Staten Island http://nyc.pediacities.com/Resource/Neighborho... Other Most Expensive
2 1 19.27 -73.975105 40.765198 1 N -74.072388 40.902721 145.50 0.0 0.5 92.0 11.75 250.05 2 2015-08-26T10:21:59.000 2015-08-26T14:05:38.000 NaN 1 0.3 POINT (-73.97511 40.76520) 49.0 Central Park 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... Central Park Most Expensive
In [44]:
# view bottom 3 priced rides
gdf.tail(3)
Out[44]:
passenger_count trip_distance pickup_longitude pickup_latitude ratecodeid store_and_fwd_flag dropoff_longitude dropoff_latitude fare_amount extra mta_tax tip_amount tolls_amount total_amount vendor_id pickup_datetime dropoff_datetime rate_code payment_type imp_surcharge geometry index_right neighborhood boroughCode borough X.id neighbor_recode cost_tier
49251 1 0.0 -73.950691 40.742889 5 N 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 1 2015-08-15T22:45:04.000 2015-08-15T22:45:04.000 NaN 2 0.0 POINT (-73.95069 40.74289) 174.0 Long Island City 4 Queens http://nyc.pediacities.com/Resource/Neighborho... Other NaN
49252 1 0.0 -74.001907 40.734409 5 N 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 1 2015-08-14T02:30:29.000 2015-08-14T02:30:29.000 NaN 2 0.0 POINT (-74.00191 40.73441) 297.0 West Village 1 Manhattan http://nyc.pediacities.com/Resource/Neighborho... West Village NaN
49253 2 0.0 -73.936996 40.764622 1 N -73.936813 40.764709 0.0 0.0 0.0 0.0 0.0 0.0 2 2015-08-04T10:00:59.000 2015-08-04T10:02:51.000 NaN 2 0.0 POINT (-73.93700 40.76462) 174.0 Long Island City 4 Queens http://nyc.pediacities.com/Resource/Neighborho... Other NaN

Observations: Looking at the extremes of total_amount reveals what are likely data errors. For instance, in the most expensive ride, trip_distance is 0, but tolls_amount is 450. In the bottom three rides, dropoff_longitude and dropoff_longitude are at 0 which does not make any sense.

I took a closer look at the codebook and there's a variable called payment_type that indicates how the passenger paid for their rides. In the future, I would maybe use this column to filter out potential errors, perhaps only focusing on credit card data.

In [ ]: